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ABSTRACT 

Bany probleas in the areas of test interpretation and 
educational assessaent are causing difficulties for educators. On one 
hand the public and legislators are requesting acre state testing 
prograas and assessaent prograas, while on the other, educators 
realize the probleas concerning testing and test interpretation. 
Difficulties arise when tests are aisinterpreted and aisused. A 
proposed aoratoriua by the national EdtK;ation Association is not the 
answer to the problea since it would destroy the continuun of data 
and create a critical inforaation gap. Reporting systeas based on 
criterion referenced aeasnreaent, the use of coaputers to find 
patterns froa which to generate interpretations, and further use of 
adjusted scores can help to alleviate soae of the probleas. A 
aoratoriua on testing would only destroy the continuua of data and 
create a critical inforaation gap. (Author/SB) 
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Along with all the other crises we're bombarcted with, 
here comes the "measurement crisis/' Dr. Coffman de- 
scribes the conflicts besetting the field: Public cries for 
accountability, but educators' fears of testing what is easy 
^o measure rather than what is important to know. Legls- 
jtion of state testing programs, on the one hand, and, on 
he other, the call by the National Education Association 
ror d moratorium on the use of standardized tests in 
chools. And the always present difficulties of test inter- 
pretation leading, too often, to misinterpretation and 
misuse of results. 

The proposed NEA moratorium, Dr. Coffman 
believes, is not the answer to the problems. He discusses as 
possible solutions improved reporting systems, perhaps 
based on criterion referenced measurement (more broadly 
developed than at present, however, he ar^^sl; perhaps 
making greater use of computers to find patterns from 
which to generate interpretations; continued exploration 
of the use of "adjusted scores." 

Dr. Coffman concludes with his idea of the kind of 
moratorium which would be beneficial to students, 
educators, and parents-^not a halt to testing, but ? halt to 
misuse of tests and test results. 

Dr. Coffman, 1972 NCME president, knows his field 
from several angles -hi^ school teacher and principal, and 
student and developer in the test and measurement field. 
Now Lmdquist Professor of Education and Director of the 
Iowa Testing Programs at the University of Iowa, he was 
associated with the test development program at the 
Educational Testing Service in Princeton, N. J., from 1952 
to 1%9, serving as associate director and then director of 
test development, and subsequently as director of research 
and development, Colle^ Board Progr3r.i< Division. He is 
currently a memter of the Analysis Advisory Ccmimittee, 
National Assessment of Educational Progress, and a con- 
sultant for other research and measurement projects. 



A MORATORIUM? 
WHAT KIND?* 

William E. Coffman 

These are critical times for educational measure- 
ment and evaluation. On the (x\e hand, the public is 
deeply concerned about the quality of education and is 
dmnanding that educators become mwe accountable 
for the quality of their proems. On the other hand, 
educators are concerned that increasing emphasis on 
standardized tests is encouraging teachers to em- 
phasize those outcome of teaching that are easy to 
measure and to neglect thorn outcomes that are hard 
to m^^ re—or for which no standardized measure 
exist. On the one hand, legislators are passing laws 
calling for the establishment of state testing pro-ams. 
On the other hand, the National Education A^ociation 
is calling for a moratorium on the use of standardized 
tests in the chools. 

To the layman, it seems only natural to um stan- 
dardized tests to find out what the chools are accom- 
pli^ing. After all, aren't the measures able to tell us 
just where each child stands in relation to the norms? 
Don't high scores mean we have ^aod schools and low 
Kjores that we have poor «:hools? To the teacher 
trained in educational measurement, the interpretation 
of standardize test »;ores is not so simple. Groups of 
children differ widely in what they know when they 
first come to school, and what one can teach a child In 
a year depends to a consicterabie c^ee on how much 
he knows at the beginning. Furthermore, there's the 
problem of bias In teste. No standardized test covers ail 
the thln^ a school is trying to teach, and sometimes 
the tests include things the a:hooI, for some very 
acceptable reason, has decided not to emphasize. 

Tests are commonly u^ in other profe^onal 
areas, but there is le^ likelihood than in the educa- 
tional setting of their results being misinterpreted. For 
example, one isn't likely to oversimplify the job of 
interpreting the results of the usual medical test. 



*Thi$ pspet is a modified version of the President iai Address at the 
artnual rr»eting of the National CouiKil on Msasursment in Education, 
New Orleans. February 27, 1973. 
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Would anybody think that the doctor with the 
healthiest patients was necessarily the best doctor? 
Certainly not, if he stopped to consider that the most 
seriously ill patients would be most likely to seek out 
the physician with the best reputation. Do patients 
generally demand that the doctor tell them the results 
of their medical tests in numbers? Not often. 
Generally, a patient ex(^ts his physician to tell him 
what the reailts mean; he isn't likely to make the 
mistake of thinking that he knows what the numbers 
mean. In contrast, nearly everybody feels confident he 
knows what a grade score of 6.3 on a reading test 
means. This is unfortunate, for the meaning of a aK>re 
on an ^ucational test is every bit as difficult to 
interpret as the numbers used in recording the results 
of a medical test. 

DISTORTED EXPECTATIONS 

I don't think I'm misrepresenting the situation 
when I say that there is today a serious tendency to 
oversimplify the interpretation and u^ of educational 
tests. Certainly, the lay public-even the elite of that 
public~^ sve distorted conceptions of what educational 
tests can lio. As Oyer (1973) reported: 

In 1971 the education committee of one of the state 
legislatures came up with an educational accountability 
bill that read in part i» follows: "If the performam^ of any 
school district on any test ^proved by the state board of 
education -does not equal or exceed the national perfor- 
mance average for such a test for two successive years, 
said school district shall not receive any further state 
finarKtai assistance -until such time as said sdiool district 
has achieved national performance aver^." 

"...the ^ip has teen wklenk^ l^tween 
the irwreash^f ^^Mstk^tkm of ttw 
test makers and tfm wx^&'sUmdk^f 
of the test user..," 



The fact that the bill did not pa^ ^ggests that legis- 
lators can be brou^t to see the error of their ways; 
but the Job of communicating— not only to the lay 
public, but even to the users of teste within the pro- 
fession-is «i monunrwntal one. Even & superficial search 
throu£^ the testing literature of the last half century 
will turn up papers in every dec«le calling attention to 
the extensive mi^se of test results on the part of the 
profession, and as Dyer elCKjuently points out, the 
gap has been widening betw^ the increfsing sophis- 
tication of the test nfiakers and ^e understarKiIng of 
the test ui^, who just can't find the time to keep up 
with the developing literature. We're simply not going 
to educate the test user to all the subtleties of test 
inteiiretation; we're going to have to desi^ more 
fool-proof reporting systems. 

Incidentally, don't think that the problem of 
keeping informed is one that hounds only the educa- 
tional profession. your general practitioner how 



"We're sfm^ twt to etkmat^ 
the test user to a0 tfm subtMes 
of test int^pr^atiofi; we're 
to fmsre to des^ mme fdo^Hoof 
reporting systmns, " 



mu<^ time he has to keep up with the medical 
literature and to learn to interpret the results of the 
hundreds of new tests that are becoming available all 
the time. Or ask your engineer friend how easy it is to 
maintain competence in a field \Nhere the knowledge 
exploaon is rendering today's competence obsolete In 
five years. 

ENOUGH, ALREADY! 

It may have been, at least in part, a recognition of 
this widening gap that motivated the NEA to approve 
the resolution calling for a moratorium on stan- 
dardized testing. If I read correctly the reports of the 
discussion ihat preceded the voting and the explana- 
tions that have followed, there was no intention of 
condemning out of hand the use of tests to monitor 
the pro^^ of (^ildren throu^ the educational 
system. In feet, there ^med to be a clear recognition 
of the responsibility of a %hool system to rwider ai* 
accounting of what it h^ been up to. 

Furthermore, another resolution appro\^ at the 
same time called upon »i|%rintendents and boards of 
education to refuse to hire yaduates of institutions 
that failed to include in their program of teacher 
education training in the interpretation and use of 
standardized tests. 

The delegates seemed to be ssying, "The gap 
between what the test makers knew about the limita- 
tions and possibilities in stendardized test scores and 
what the u%rs are able to apply has become so wide 
that more harm than good is resulting from the use of 
stendardized tests In the »^ools. Let's put a stop to it 
all until the test m«ri<ers c»n come up with packages 
less subjet:t to misuse or until the profession can 
develop the sophistication needed to prevent serious 
misinterpretetions." 

DRAWBACKS OF MORATORIUM 

I'm afraid, however, that the solution proposed by 
the NEA is too drastic. Can we honestly say that we 
ouj^t to take away from the thousands of knowledge- 
able and «nsitive test u»rs Information on which they 
have come to ctepend? Or to create a gap in the 
accumulated r^»rd of pupils' process throu^ the 
educational system that may some day permit insight- 
ful re^rchers to create a picture of the ebb and flow 
of the educational tides of the 1970's? 
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In many instances we're just beginning to learn how 
to organize our accumulating data files »} that the 
hiddan relationships can be worked out. To abandon 
the data collection because somebody might rnlsu^ it 
would be a mistake of the first magnitude, for it is 
primarily through a study of patterns of chan^ in test 



'To e^^Khm the data co^ctkm 
becmise somebody might misuse 
it would be a nm^se of the 
first magnittHte... " 



performance over time as changes occur in the inputs 
and treatments that the researcher is able to form 
judgments about what may be happening and why. 
And it won't do simply to select a representative 
sample of schools on which to collect data. To some 
extent each school system is unique, and to understand 



scores, even those based on a longitudinal p^owth 
model. If we are to introduce numbers that permit us 
to c^art the growth of children through the educa- 
tional system and at the same time avoid the dangers 



"(Cliterkm referenced nwasur&nent) 
has ^iterated some output that is 
potentiaMy as dans^rous to education 
as sfiath mguiva/mtt 5Com." 



inherent in a grade equivalent system, we're going to 
have to do two thin^: ( 1 ) provide a way of ^ing from 
the new system back to the o\d in order to preserve the 
continuity of the record, and {2) provide a means of 
bridging ttie gap between the numbm and the 
decisions they imply about what we ^ould l» doing 
with individual ixtpils as a r^lt of knowing their test 
scores. 



"To some extmtt em^ so^oo/ 
system is unique../' 



what is going on in a school system, it's nece^ry to 
have a finger on the pulse of that system, not simply to 
apply blindly what has been learned from studying 
another system. 



IMPROVE REPORTING SYSTEMS 

Some diagnosticians have su^sred that the first 
thing that needs to be done is to get rid of reporting 
systems that are subject to misuse. For example, the 
current dra*t of the "Standards for the Development 
and Use of Educational and Psychological Tests" 
includes this recommendation: 

Intefpretive scores which lend themselves to cp^oss mis- 
interpretation, such as Tiental age or grade equivalent, 
should be abandoned or their use discour^d, VERY 
DESIRABLE (APA, 1973) 
In general, I think I agree with this recommendation. 
Even people with considerable experience in dealing 
with quantitative data can be misled by grade equiv- 
alent scores. For example, the report on the study of 
equality of educational opportunity prepared by James 
Coleman (1966) and his associates talks about blacks 
getting further and further behind as they go through 
«:hool, and the statement is repeated in the recent 
book by Mosteiier and Moynahan (1972). 

On the other hand, teachers report, with some 
justification, I think, that the grade score does give 
some hint of the level at which one ^ould begin 
teaching a child, a kind of information not readily 
inferred from either percentile ranks or standard 

o 
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CRITERION REFERENCED MEASUREMENT 

I suppose the obvious solution to providing inter- 
pretation is to provide a criterion referenced inter- 
pretation, and let me say that this Isn't nearly so 
mvxbm an idea as some f^opie would have us twiieve. 
Even before 1920 E. L. Thomdike was examining the 
problem of educational measurennent in the broader 
context of m^iairement in the aiiencw and concluding 
that «Jucationat measure* ient would follow physical 
measurement in developing a system wfiere each score 
had an immediately meaningful referent. From our 
perspective of the 1970'$ we can see that it was 
probably a mistake to develop scores that answered the 
question, "What ^oup is this person's performance 
like the avera^^ of?" The discu^ion in recent years of 
criterion referenced measurement h^, on the whole, I 
think, been salutary. At the same time, it has ^rerated 
some output that is potentially as dan^rou? to educa- 
tion as grade equivalent scores. 

The problem, as Krathwohl arKi Payne so clearly 
point out in their chapter in the second &:lition of 
Educational Measurement (1971), involves the conflict 
between skills which are important to learn, and skills 
which are most easy to te^h and to measure. The 
kinds of learnincp that are most important are those 
that involve complex ^ills and understandlnc^ and 
thus that d^elop slowly over the years. The learn ir^ 
that respond most rapidly to instruction and are easily 
demonstrate through responses to test items mvolve 
i^imple ^ills and recai; of information that serve 
primarily as vehicles for the development of more 
complex learning. To the extent that measurament 
deals only with the simple learnings and ignores the 
more complex, it encourages the training of simple 
responses without emphasizing the fact that the way in 




which the simple learnings are devetof»d may be 
crucial for the development of the more complex ones. 
Thus, to specify, as some have, that the criterion of an 
acc^tabie item for a criterion-referenced test is that it 
be responsive to teaching (over the short pull), is to 
reinforce a superficial concept of what education is ail 
about. 



SIMPLE SKILLS ESSENTIAL 

This is not to say that there is no place for the 
ctevelopment of simple skills. The problem of breaking 
the code in beginning reading is one that involves 
hundreds, if not thousands, of specific learnings. One 
reason there are so many failures in learning to break 
the code is simply that the task of checking to see that 
a particular child has noticed and understood the sig- 
nificance of each one of these critical elements is a 
monumental one for any teacher faced wi;n 20 to 30 
squirming first ^wJers. Any system that will help the 
teacher to determine that the many messages have 
been received and responded to is likely to improve the 
teaching of beginning reading. 

But let's not be confused. The reason for giving 
^ious attention to instruction in reding is to get the 
child ready for the task of taking over, bit by bit, the 
responsibility, und&r the guidance of the teacher, of his 
own education. And one task of a good testing 
program is to chart the pro^^ of the pupil toward 
this goal. Before long, the school must get on with the 
task of helping pupils learn how to use what they are 
turning in complex meaningful contexts, and the 
toting program nsust help professional educators know 
whether or not this t£d< is l»ing accomplished. 

Most of the :» '^er ion -referenced tests coming on the 
market today $e to fc» concerned primarily with the 
«ese»ment of pre/ « in the building of a repertory of 
biBic information J simple skills. Before long, we're 
going to face the necessity of providing criterion 
reference for rr,. complex learnings, or else serious 
educators will judge our testing profession irrelevant. 



"Befyre Itmg, weVe gtA^ to fmm 
ttm nec^stty of i^wkimg criterkm 
references for more complex learrjings, 
or MvtHis etkmat(^ wMI fiid^ 
our testing profession irrefe^fant. " 



POSSIBLE SOLUTIONS 

In this connection, I find the 1962 paper of Bob 
Ebel hl^ly significant. You will remember that Bob 
proposed what he called "content standard 
scores"-«:ores that could be interprets! by referring 
to small groups of illustrative test questions. So far as I 
have b^n able to determine, no test publisher has 
taken up Bob's challenge, but 1 think it would be 
possible, using the procedure outlined by Ebel, to 
develop reference sett of questions for m(«:t survey- 
type tests that now provide only norm-referenced 
Interpretations. The K)oner w«> ^t ^ing on this task, 
the better. 
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Another way of soivtng the problem of misinter- 
pretation may be to have the expert provide more 
dirept interpretation as part of the score report. Of 
course, the finai decisions need to be at the local 
level-by teachers and parents and pupils who have 
ik:(^ to much more information about the individual 
and the learning situation than can ever be reflected in 
accumulated test data. However, just as the physician 
is coming to rely more and more on Interpretations of 
test data provided by specialists or on interpretations 
^^ated after referring patterns of test scores to the 
information stored in the memory of a computer, so 
we may find it possible to m^.ke use of the computer 
to ^nerate interpretative state.nene that provide an 
interfat.a between the test scores and the user. 

Teaching a computer to do ihis in a manner and 
with the qualifications necessary to insure that verbal 
reports do not become the "mo iem Go^l" just as 
grade equival^t scores have been in the past will be 
quite a challenge, but we are making a beginning. 
Tomorrow I will be reporting the reailts of the efforte 
of a team working in Iowa City last S-immer under the 
leadership of Professor Waiter Mathews of the 
University of Mississippi. Before too long, i hope to be 
able to report on a field testing of a pilot system of 
(K>mputer-generated verbal score reports for the Iowa 
Test of Basic Skills. 



INTERPRETING GROUP SCORES 

When we turn from the interpretation of individual 
scores to the interpretation of scores for ^oups, we are 
faced with new problems. By abandoning grade equiv- 
alent scores we might c^t rid of the problem reflected 
in the new^per report that "one-fourth of the pupils 
in grade six in Port City are two years retarded in 
reading." But whatever system is used, any attempt to 
make a direct interpretation of averaf^ scorn for Port 
City will show that Port City's test performance is \mt. 
It's only a step from there to the inference that there 
must be something wrong with Port City's shoots. 

Now let's be clear on one point. Assuming we h^e 
some sort of criterion interpretation available, the 
actual performance of the pupils in Port City~or in a 
particular school in Port City or in a particular class- 
room in a particular school in Port City-will reflect 
clearly the educational problem faced by the city. The 
Instruction needs to begin where the pupils are, and one 
neet!3 to have information about where the pupils are 
: f one is to design an effective educational program for 
them. But if the test results are to be used for purposes 
of accountability, the t^t scores must be placed in con- 
text. Some way must be found for answering the 
qu^ion, "Given all we know about the situation in 
Port City, how do these results stack up?" 

Right now the most popular way of reporting test 
results for a school system Is in terms of national 
norms. I'm afraid this is because such a comparison 
keeps the fun in testing for all concerned. Half of the 
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REPORTS AVAILABLE 

! Back i^es of Measurem&it in Education are 
available at 35(^ each in quantities of 25 or more 
for a single issue. 

Vol. 1, No. 1 /ye/p/>^ Teachers Use Tests by 
Bck>ert L. Thomdike 

No. 2 interf^ting Adiievement Profiles- 
Uses arKt Warnings by Eric F. 
Gardner 

No. 3 M^tery Leamirtg and Mastery 
Testing by Samuel T. Mayo 

No. 4 On R^rting Test Results to Com- 
munity Groups by Aiden W. Badal 
& Edwin P. Larsen 

Vol. 2, No. 1 National A^^sm&7t Says by Frank 
B. Womer 

I No. 2 7>ie PLAN System for Individ- 

ualizing Education by John C. 
Flanagan 

No. 3 Measurerrwnt Aspects of Perfor- 
mance Contracting by Richard E. 
Schutz 

No. 4 The History of GriKfing Prxtices by 
Louise Witm^ Cureton 

Vol. 3, No. 1 Using Your Achievement Test Sn>re 
Reports by Edwin Gary Joselyn & 
Jack C. Merwin 

No. 2 An Item Analysis Service for 
Teachers by Willard G. Warrington 

Ho. 3 On the Reliability of Ratings of 
'■ Essay Examinations by William E. 

Coffnruin 

No. 4 Criterion-Referenced Toting in the 
Cla^room by Peter W. Airasian and 
George F. Madaus 

Vol. 4, No. 1 Goals and ObMctives in Planning 
and Evaluation: A Second Gener- 
\ ation by Victor W. Doherty and 

I Walter E. Hathaway 

j No. 2 Career Maturity by John 0. Crites 

1 No. Z Assaying Educational Achievement 

I in the Affective Domain by Ralph 

j W. Tyler 

i No. 4 The National Test-Equating Study 

I in Rexling (The Anchor Test 

I Study) by Richard M. Jaeger 

i Vol. 5, No. 1 The Tangled Web by Fred F. 
I Harcieroad 



systems can report that they're OK since they are 
above the norm and the other half can report that of 
course they aren't up to norm, but then look at how 
different the system is from the average system in the 
country. Everybody wins and nobody has to ask the 
hard question. 

For most widely used batteries of achievement tests, 
the publ shers have provided norms for various sub- 
groups in addition to national norms-regional norms, 
norms for large cities, norms by IQ levels, and the like. 
But these don't really solve the problem of account- 
ability. The logical end point of resort to differential 
norms is that the appropriate norm for system X is the 
results for system X since system X is unique. Besides, 
as the variety of norms proliferates, the task of inter- 
P'-etation becomes more complex. School adminis- 
t:ators and politicians and the man-in the street can't 
be blamed for feeling that the testers are trying to 
obscure the meaning of the test scores. Some way must 
be found to provide a simpler way of evaluating the 
test results for a school system. 



A CASE FOR ADJUSTED SCORES 

Henry Dyer has proposed an evaluation model based 
on the application of multiple regression techniques 
(1970). The model has much to recommend it since it 



"S&ne way nwst be found to pmkh 
a simf^ way of evahsatk^ the test 
re^i/ts for a sc/hh^ sys^m. " 



does take into account some of the variables that seem 
to account for differences among school systems but 
which school systems can't do much about. As Forsyth 
(1972. 1973) has pointed out, there are still some 
sticky problems in applying the Dyer model, but it is a 
step in the right direction. The fact that the model is 
being proposed for use in New York City and is 
apparently acceptable to the teachers (Shanker 1973) 
is certainly encouraging. 

Application of the Dyer model has the effect of 
substituting for the scores actually obtained, scores 
that have been adjusted for differences in the charac- 
teristics included in the model. This is not the only 
instance of efforts to develop adjusted scores. 

In the summer and early fall of 1971 I had the 
opportunity to meet with a youp under the leadership 
of Dr. Selma Mushkin of Geor^town University that 
was asked by the U. S. Office of Education to look 
into the feasibility of developing systems of adjusted 
test scores for reporting summaries of test results for 
school systems. I discovered that in many areas, statis- 
tical reports consist of numbers that have been 
adjusted to take into account differences from g-oup 
to group that might obscure the real meaning of the 



data. For example, raw death rates for cities are* not 
directly comparable because of differences in the age 
distribution from one city to another. To answer^the 
question, "Which city has the lowest death rate, taking 
into account Ihe age distribution in the population?" 
adjusted death rates are reported. 

More recently, as a member of the Analysis Com- 
mittee for the National Assessment Project. I've had 
the opportunity to observe how statisticians like John 
Tukey and Fred Mosteller go about the busineK of 
adjusting statistical data to reduce the likelihood that 
they wilt be misinterpreted. 

I doubt that we can ever produ<% reports that are 
completely resistant to misinterpretation, but I do 
think that much more can be done to produce refiorts 
less subject to misinterpretation. If the resolution of 
the NEA has the effect of speeding up the develop- 
ment of such reporting systems, it will have had a 
salutary effect. 



ONE KIND OF MORATORIUM 

I gue^ I haven't been saying very much about what 
kind of moratorium I think is in order, have I? I've 
been talking about the need for more training in 
testing and evaluation for teachers and administrators, 
about more emphasis on criterion-referenced Inter- 
pretations of test scores, about the need to report test 
results in ways that are meaningful to those for whom 
the reports are intended, about the possibilities of 
adjusting summary statistics for differences in Inputs. I 
could also have talked about the need for developing 
measures of a whole lot of additional variables to 
in»jre that our evaluations are comprehensive, but 
Jack Merwin did that for us last year (1973) and all I 
need to add is "Amen". 

I've questioned the desirability of calling a complete 
moratorium on standardized testing in the schools: to 
interrupt the data collection pro<^ while we perfect 
our evaluation system is to create a critical information 
gap. But I see nothing wrong at all with enojuraging a 
moratorium on the use of test scores to label children 
rather than to guide their learning, to classify te«:h€rs 
rather than to identify points where teachers may be 
helped to become more effective, to pull the wool over 
the eyes of the public rath«- than to generate questions 
about how a school system might ^ about doing an 
even better job. Let's not spend too much time 
deploring the NEA's resolution; let's ^t on with the 
business of meeting their demands for better tests, 
better reporting systems dnd wiser test use. 
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